Preparing Korean Data for the Shared Task on Parsing Morphologically Rich Languages
نویسنده
چکیده
This document gives a brief description of Korean data prepared for the SPMRL 2013 shared task (Seddah et al., 2013). A total of 27,363 sentences with 350,090 tokens are used for the shared task. All constituent trees are collected from the KAIST Treebank and transformed to the Penn Treebank style. All dependency trees are converted from the transformed constituent trees using heuristics and labeling rules designed specifically for the KAIST Treebank. In addition to the gold-standard morphological analysis provided by the KAIST Treebank, two sets of automatic morphological analysis are provided for the shared task, one is generated by the HanNanum morphological analyzer, and the other is generated by the Sejong morphological analyzer. 1 Constituent Treebank All constituent trees are collected from the KAIST Treebank (Choi et al., 1994). The KAIST Treebank contains about 31K manually annotated constituent trees from 97 different sources (e.g., newspapers, novels, textbooks). After filtering out trees with annotation errors, a total of 27,363 trees with 350,090 tokens are collected. Table 1 shows distributions of the training, development, and evaluation sets used for the shared task. Train Develop Evaluate Trees 23,010 2,066 2,287 Tokens 296,446 25,278 28,366 Table 1: Distributions of the training, development, and evaluation sets used for the shared task. Constituent trees in the KAIST Treebank also come with manually inspected morphological analysis based on ‘eojeol’. An eojeol contains root-forms of word tokens agglutinated with grammatical affixes (e.g., case particles, ending markers). An eojeol can consist of more than one word token; for instance, a compound noun “bus stop” is often represented as one eojeol in Korean,!Q€o&Ò⌦¿”Å©ú, which can be broken into two word tokens,!Q€o (bus) and&Ò⌦ ¿”Å©ú (stop). Each eojeol in the KAIST Treebank is separated by white spaces regardless of punctuation. Figure 1 shows morphological analysis for a sentence, “I drank cognac.” in Korean. (Cognac) . I cognac drank + +(+Cognac+)+ + + +. I+tpc cognac+(+Cognac+)+obj drink+past+final+. Figure 1: Morphological analysis for a sentence, “I drank cognac.” in Korean, where each morpheme is separated by a plus sign (+). tpc: topical auxiliary, obj: objective case particle, past: past-tense ending marker, final: final ending marker.
منابع مشابه
Introducing the SPMRL 2014 Shared Task on Parsing Morphologically-rich Languages
This first joint meeting on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical English (SPMRL-SANCL) featured a shared task on statistical parsing of morphologically rich languages (SPMRL). The goal of the shared task is to allow to train and test different participating systems on comparable data sets, thus providing an objective measure of comparison...
متن کاملThe First Workshop on Statistical Parsing of Morphologically Rich Languages
The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop on statistical parsing ofMRLs hosts a variety o...
متن کاملEffective Morphological Feature Selection with MaltOptimizer at the SPMRL 2013 Shared Task
The inclusion of morphological features provides very useful information that helps to enhance the results when parsing morphologically rich languages. MaltOptimizer is a tool, that given a data set, searches for the optimal parameters, parsing algorithm and optimal feature set achieving the best results that it can find for parsers trained with MaltParser. In this paper, we present an extensio...
متن کاملOverview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
This paper reports on the first shared task on statistical parsing of morphologically rich languages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the evaluation metrics for parsing MRLs given different representation types. We presen...
متن کاملتأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1309.1649 شماره
صفحات -
تاریخ انتشار 2013